SMART Attributes For Predicting HDD Failure

“Know Thyself”, Socrates once said. While this Greek philosopher didn’t have hard drives in his audience, he may as well have. Modern HDDs are able to capture data about their own performance in the form of SMART attributes.

“S.M.A.R.T” stands for “Self-Monitoring, Analysis, and Reporting Technology”. As with humans, self-knowledge isn’t the end of a hard drive’s road to wisdom: it must apply it. Manufacturers pair each SMART attribute with a threshold in order to pinpoint hardware issues. This means that SMART stats are important tools for monitoring HDD health.

The road to hardware wisdom isn’t always easy. SMART attributes can be opaque, as they can vary with manufacturers, who don’t always make the units of measurement public. However, there are some clever ways to use SMART attributes to predict and manage hard drive failure.

A Brief History Of Self-Monitoring Technology

The first self-monitoring technology for hard drives appeared in 1992, in IBM’s 9337 disk arrays of IBM 0662 SCSI-2 disk drives. The monitoring technique was later named “Predictive Failure Analysis”, or “PFA”. In this early version, no measurements were transmitted: just a binary signal of “drive ok” or “drive not ok”.

Fast forward a few years, and Compaq, Seagate, Quantum, and Connor teamed up for a system known as “IntelliSafe”. This technology allowed particular measured values to be sent to the operating system. In 1995, the same group submitted this technology to the Small Form Factor committee, which was later absorbed by the Storage Network Industry Association, or SNIA. The standardization initiative was supported by IBM and Western Digital. From there, the rest is history.

What Are SMART Attributes?

Firstly, we need to distinguish SMART attributes from the Self-Monitoring, Analysis, and Reporting Technology itself. Technically, S.M.A.R.T. is a way of sending status information from your hard drive to the surrounding system. The information transmitted (the SMART attributes) sometimes includes actual measurements, such as drive temperature. It also tells you whether the measured attribute has passed a certain threshold, indicating a problem. For example, S.M.A.R.T. might tell you when a drive’s internal temperature has passed 122°F, which can lead to failure in many hard drives. In short: for each attribute, the message “threshold exceeded” is usually bad, and “threshold not exceeded” isn’t.

SMART attributes, as displayed on the CrystalDiskInfo software. The raw values are expressed as hexadecimal numbers.

Every drive manufacturer determines which parameters are to be monitored, and chooses the parameter values at which a “threshold exceeded” message should be sent. What drives have in common is the standardized technology which allows signals to be sent from internal sensors to the operating environment. What differs are the parameters measured, the units of measurement, and the thresholds which indicate “bad” values.

When it comes to S.M.A.R.T., not all interfaces are created equal. The standards documentation for S.M.A.R.T was originally developed for ATA drives. Software tools can help you receive SMART data over a SATA, SAS, or even NVMe interfaces. However some interfaces, such as USB, prove particularly tricky.

Some Examples of SMART Attributes

What are the drive stats that S.M.A.R.T. is used to capture? The short answer is that these attributes are whatever the drive manufacturer wants them to be. A Seagate drive will collect information on different parameters than a Western Digital or Toshiba drive.

The folk at NTFS have a handy guide to different SMART attributes, though they emphasize that definitions and measurements vary by manufacturer. Some common SMART attributes include:

S.M.A.R.T. is useful, but not infallible. If your hard drive is currently on fire, you’re unlikely to find its temperature by consulting the relevant SMART attribute. Also, like any other technology, implementations of S.M.A.R.T. may be poorly designed, or prone to error in the event that an internal sensor fails. Finally, S.M.A.R.T can only help detect predictable failures: damage due to a sudden power surge or a deliberate strike with a mallet are outside of its purview.

In other words, S.M.A.R.T. may be more useful at predicting aggregate reliability than it is at predicting the failure of individual drives.

SMART attributes can be used for more than just predicting drive failure. For example, when buying a factory recertified drive, you can confirm that the drive has low or no power-on hours by checking SMART 09.

Top Secret: The Hidden Side of SMART Stats

SMART attributes can be tricky. Firstly, the particular parameters measured can vary by manufacturer. Secondly, even when two manufacturers measure the same parameter, they may use different units. The final source of potential confusion is that the units of measurement are often hidden. In the context of SMART attributes, “raw values” refer to the numbers associated with each measurement. Even if you know what raw value an attribute has, interpreting that number takes a lot of guesswork, if it can be interpreted at all.

For example, while SMART 04 directly counts the number of spindle start/stop cycles, SMART 220 measures, with some unknown unit of measurement (which may itself vary by manufacturer), the distance that a platter has shifted with respect to the spindle.

However, limited information about an attribute is a lot better than no information at all. Again, SMART 220 is illustrative. Even if you don’t know how to convert the numerical value provided into the millimeter displacement of a shifted platter, you know that a lower number is better, as this indicates a smaller shift. Since manufacturers also pair drives with threshold numbers, you can sometimes identify problems even if you don’t know the units of measurement in question.

Which Smart Attributes Matter?

When in doubt, look at the thresholds which come with each attribute. These thresholds are set by the manufacturer, who is likely to know best which measured values indicate a healthy drive. For instance, a ruggedized drive may be designed to operate safely under higher temperatures than a normal HDD, and manufacturer thresholds will indicate that.

However, not all SMART attributes are equally useful in detecting problems. For example, while power-on hours may indicate a certain amount of drive wear, it won’t signal an imminent failure. Even stats which are correlated with failure may not be much help for prediction, as correlation is a matter of degree. For example, a drive exceeding its temperature threshold can lead to problems, but it makes more sense to examine a different SMART attribute to see if it did lead to problems.

So which attributes should you look at? Over years of statistical sleuthing, cloud storage and data backup company Backblaze has found the following five attributes most useful in predicting failure. If one of the raw values below is above zero, there may be an issue.

Attributes which Backblaze has found most indicative of HDD failure.

The firm reported in 2016 that in 76.7% of drive failures, one or more of the attributes above had a raw value greater than zero.

How to Check Smart Attributes

There are several apps of varying quality which one can use to read SMART data from your hard drive. Free apps include Smartmontools, used by Backblaze, and CrystalDiskInfo. Paid options include Passmark, and SMARTHDD.

Manufacturers also sometimes provide applications which let you view the SMART attributes of your drives, and may also include additional tools for indicating drive health. Finally, if you have a Windows-equipped device, you can briefly check whether a drive is about to fail by typing “wmic diskdrive get model, status” into the command prompt.

Seen one way, machine learning is just automated statistics at scale with a moving target. In other words, great for pinpointing how the behavior of an ever-changing population of drives can be predicted using SMART attributes. According to one recent paper, SMART attributes, with a little help from machine learning, can be also used to predict short-term drive health (30-90 days) and long-term health (roughly 3 years).

Conclusion

SMART attributes are subtle. They aren’t always reliable, and the fact that manufacturers use different measurements and definitions of parameters can make the raw values difficult to interpret. However, for those managing large pools of hard drives, SMART attributes can be a very handy way to gain valuable insights in order to diagnose or prevent hardware issues. No one said self-knowledge is easy, but self-monitoring technology can put you and your drives on the path to hardware wisdom.

All drives fail eventually, and when they do it helps to have a plan. Find out how Horizon can provide expert support for your hardware lifecycle management needs.

SMART Attributes For Predicting HDD Failure

A Brief History Of Self-Monitoring Technology

What Are SMART Attributes?

Some Examples of SMART Attributes

Top Secret: The Hidden Side of SMART Stats

Which Smart Attributes Matter?

How to Check Smart Attributes

Conclusion

Expand Your ITAD Horizons

Help your CFO get more out of your data center hardware with this handy guide to IT asset recovery.

Decommissioning Your Data Center? Click Below For Our Handy Checklist

What will be the impact of NVMe-oF on the data center? Find out in our brief guide.

Worried About Keeping Your Data Secure During A Decommission? Click Below For Our eBook.

HORIZON

COMPANY INFO

RESOURCES

FOLLOW US